27 research outputs found

    Compressed multiple pattern matching

    Get PDF
    Peer reviewe

    Lempel-Ziv Parsing for Sequences of Blocks

    Full text link
    The Lempel-Ziv parsing (LZ77) is a widely popular construction lying at the heart of many compression algorithms. These algorithms usually treat the data as a sequence of bytes, i.e., blocks of fixed length 8. Another common option is to view the data as a sequence of bits. We investigate the following natural question: what is the relationship between the LZ77 parsings of the same data interpreted as a sequence of fixed-length blocks and as a sequence of bits (or other “elementary” letters)? In this paper, we prove that, for any integer b > 1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b = 8 in case of bytes) are related as zb = O(bz lognz ). The bound holds for both “overlapping” and “non-overlapping” versions of LZ77. Further, we establish a tight bound zb = O(bz) for the special case when each phrase in the LZ77 parsing of the string has a “phrase-aligned” earlier occurrence (an occurrence equal to the concatenation of consecutive phrases). The latter is an important particular case of parsing produced, for instance, by grammar-based compression methods. © 2021 by the authors. Licensee MDPI, Basel, Switzerland.Funding: This research was funded by the Ministry of Science and Higher Education of the Russian Federation (Ural Mathematical Center project No. 075-02-2021-1387)

    Detecting One-variable Patterns

    Full text link
    Given a pattern p=s1x1s2x2sr1xr1srp = s_1x_1s_2x_2\cdots s_{r-1}x_{r-1}s_r such that x1,x2,,xr1{x,x}x_1,x_2,\ldots,x_{r-1}\in\{x,\overset{{}_{\leftarrow}}{x}\}, where xx is a variable and x\overset{{}_{\leftarrow}}{x} its reversal, and s1,s2,,srs_1,s_2,\ldots,s_r are strings that contain no variables, we describe an algorithm that constructs in O(rn)O(rn) time a compact representation of all PP instances of pp in an input string of length nn over a polynomially bounded integer alphabet, so that one can report those instances in O(P)O(P) time.Comment: 16 pages (+13 pages of Appendix), 4 figures, accepted to SPIRE 201

    Palindromic Length of Words with Many Periodic Palindromes

    Full text link
    The palindromic length PL(v)\text{PL}(v) of a finite word vv is the minimal number of palindromes whose concatenation is equal to vv. In 2013, Frid, Puzynina, and Zamboni conjectured that: If ww is an infinite word and kk is an integer such that PL(u)k\text{PL}(u)\leq k for every factor uu of ww then ww is ultimately periodic. Suppose that ww is an infinite word and kk is an integer such PL(u)k\text{PL}(u)\leq k for every factor uu of ww. Let Ω(w,k)\Omega(w,k) be the set of all factors uu of ww that have more than k1uk\sqrt[k]{k^{-1}\vert u\vert} palindromic prefixes. We show that Ω(w,k)\Omega(w,k) is an infinite set and we show that for each positive integer jj there are palindromes a,ba,b and a word uΩ(w,k)u\in \Omega(w,k) such that (ab)j(ab)^j is a factor of uu and bb is nonempty. Note that (ab)j(ab)^j is a periodic word and (ab)ia(ab)^ia is a palindrome for each iji\leq j. These results justify the following question: What is the palindromic length of a concatenation of a suffix of bb and a periodic word (ab)j(ab)^j with "many" periodic palindromes? It is known that PL(uv)PL(u)PL(v)\lvert\text{PL}(uv)-\text{PL}(u)\rvert\leq \text{PL}(v), where uu and vv are nonempty words. The main result of our article shows that if a,ba,b are palindromes, bb is nonempty, uu is a nonempty suffix of bb, ab\vert ab\vert is the minimal period of abaaba, and jj is a positive integer with j3PL(u)j\geq3\text{PL}(u) then PL(u(ab)j)PL(u)0\text{PL}(u(ab)^j)-\text{PL}(u)\geq 0

    Lempel–Ziv-Like Parsing in Small Space

    Full text link
    Lempel–Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel–Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the kth order empirical entropy compression nHk+ o(nlog σ) with k= o(log σn) , where n is the input length and σ is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be Ω (log n) times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below 2.0 in all tested scenarios, and sometimes below 1.05, to the size of LZ. © 2020, Springer Science+Business Media, LLC, part of Springer Nature.D. Kosolobov supported by the Russian Science Foundation (RSF), Project 18-71-00002 (for the upper bound analysis and a part of lower bound analysis). D. Valenzuela supported by the Academy of Finland (Grant 309048). G. Navarro funded by Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile. S.J. Puglisi supported by the Academy of Finland (Grant 319454). This work started during Shonan Meeting 126 “Computation over Compressed Structured Data”. Funded in part by EU’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie Grant Agreement No. 690941 (project BIRDS)

    Palk is linear recognizable online

    Full text link
    Given a language L that is online recognizable in linear time and space, we construct a linear time and space online recognition algorithm for the language L・Pal, where Pal is the language of all nonempty palindromes. Hence for every fixed positive k, Palk is online recognizable in linear time and space. Thus we solve an open problem posed by Galil and Seiferas in 1978. © Springer-Verlag Berlin Heidelberg 2015

    Palindromic Decompositions with Gaps and Errors

    Full text link
    Identifying palindromes in sequences has been an interesting line of research in combinatorics on words and also in computational biology, after the discovery of the relation of palindromes in the DNA sequence with the HIV virus. Efficient algorithms for the factorization of sequences into palindromes and maximal palindromes have been devised in recent years. We extend these studies by allowing gaps in decompositions and errors in palindromes, and also imposing a lower bound to the length of acceptable palindromes. We first present an algorithm for obtaining a palindromic decomposition of a string of length n with the minimal total gap length in time O(n log n * g) and space O(n g), where g is the number of allowed gaps in the decomposition. We then consider a decomposition of the string in maximal \delta-palindromes (i.e. palindromes with \delta errors under the edit or Hamming distance) and g allowed gaps. We present an algorithm to obtain such a decomposition with the minimal total gap length in time O(n (g + \delta)) and space O(n g).Comment: accepted to CSR 201

    Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries

    Get PDF
    Longest common extension queries (LCE queries) and runs are ubiquitous in algorithmic stringology. Linear-time algorithms computing runs and preprocessing for constant-time LCE queries have been known for over a decade. However, these algorithms assume a linearly-sortable integer alphabet. A recent breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the two notions: all the runs in a string can be computed via a linear number of LCE queries. The first to consider these problems over a general ordered alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an O(n(logn)2/3)O(n (\log n)^{2/3})-time algorithm for answering O(n)O(n) LCE queries. This result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to O(nloglogn)O(n \log \log n) time. In this work we note a special \emph{non-crossing} property of LCE queries asked in the runs computation. We show that any nn such non-crossing queries can be answered on-line in O(nα(n))O(n \alpha(n)) time, which yields an O(nα(n))O(n \alpha(n))-time algorithm for computing runs

    Run compressed rank/select for large alphabets

    Get PDF
    Given a string of length n that is composed of r runs of letters from the alphabet 0,1,..,σ-1 such that 2 ≤ σ ≤ r, we describe a data structure that, provided r ≤ n/log ω(1) n, stores the string in rlog nσ/r + o(r log nσ/r) bits and supports select and access queries in O(log log(n/r)/loglogn) time and rank queries in O(log log(nσ/r)/log time. We show that r log n(σ-1)/r-O(log n/r) bits are necessary for any such data structure and, thus, our solution is succinct. We also describe a data structure that uses (1 + ϵ)r log nσ/r + O(r) bits, where ϵ > 0 is an arbitrary constant, with the same query times but without the restriction r ≤ n/log ω(1) n. By simple reductions to the colored predecessor problem, we show that the query times are optimal in the important case r ≥ 2logδ n, for an arbitrary constant δ > 0. We implement our solution and compare it with the state of the art, showing that the closest competitors consume 31-46% more space. © 2018 IEEE.Peer reviewe

    Cold intense electron beams from LN2-cooled GaAs-photocathodes

    No full text
    To study electron-ion interactions at the Heidelberg heavy-ion storage ring, electron beams with low-energy spreads and dc-currents of milliamperes are desired. Measurements of the photoelectron energy distribution showed that electron beams with energy spreads of 5-8 meV can be obtained from GaAs photocathodes, cooled to about LN2-temperature. However, in order to get milliamperes beam currents, the laser illumination has to be increased up to 1 W, causing substantial cathode heating. The presented new electron gun design based on sapphire-substrate transmission-mode photocathodes, cooled by LN2, stabilizes the GaAs bulk temperature under 1 W laser illumination at about 95 K and thereby provides the prerequisites for an electron gun being operated at milliampere-currents with low-energy spreads
    corecore